<!DOCTYPE html>
<html>
<head>

<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css">
<link rel="stylesheet" href="http://netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap-responsive.min.css">

<style type="text/css">
.nav { }
.nav li { float: left; width: 110px; }
.container { background-color: #ffffff; padding: 30px; }
.content { padding: 30px; }
body, h1, h2, h3, h4 { font-family: "Trebuchet MS","Helvetica Neue",Arial,Helvetica,sans-serif,"Georgia";}
body { background-color: #eeeeee; }
header { width: 100%; }
blockquote p { font-size: 14px; }
</style>
<title>ZPar | Phrase-Structure Parsing</title>
</head>

<body>
<div class="table-bordered container">
<div class="content">

<header>
<div class="page-header text-center">
<h1>Phrase-Structure Parsing</h1>
</div>
</header>


<h2><a id="How-to-compile">How to compile</a></h2>
<p>
Suppose that ZPar has been downloaded to the directory <code>zpar</code>. To make a phrase-structure parsing system for English, type <code>make english.conparser</code>. This will create a directory <code>zpar/dist/english.conparser</code>, in which there are two files: <code>train</code> and <code>conparser</code>. The file <code>train</code> is used to train a parsing model,and the file <code>conparser</code> is used to parse new texts using a trained parsing model. Similarly, we can make a phrase-structure parsing system for Chinese by typing <code>make chinese.conparser</code>. The <code>train</code> and <code>conparser</code> files are created under the directory of <code>zpar/dist/chinese.conparser</code>. Note that the English and Chinese parsers are designed specifically for <a href="http://www.cis.upenn.edu/~treebank/">Penn Treebanks</a>.
</p>

<h2><a id="Format-of-inputs-and-outputs">Format of inputs and outputs</a></h2>
<p>
The input file to the <code>train</code> executable contains a set of parse trees, one for each line. An example parse tree is as follows:
</p>
<pre><code> ( S r ( NP r ( NNP t Ms. )  ( NNP t Haag )  )  ( S l* ( VP l ( VBZ t plays )  ( NP s ( NNP t Elianti )  )  )  ( . t . )  )  )</code></pre>
<p>
The format is different from the original format used in Penn Treebanks. Here is a <a href="con_files/binarize.py">Python script</a> to convert the original Penn Treebank format to the ZPar format. The usage is
</p>
<pre><code> python binarize.py &lt;rule-file&gt; &lt;input-file&gt; </code></pre>
<p>
Here <code>rule-file</code> is a file containing head-finding rules (see the <a href="con_files/rule.txt">example rules</a> for Penn Chinese Treebank), and the conversion results will be printed to the console. Note that, in the respect of Chinese, the encoding of input-file to <code>binarize.py</code> should be <code>gb</code> and the output will be encoded in <code>utf8</code>. Here is a <a href="seg_files/gb2utf.py">script</a> that transfers files that are encoded in <code>gb</code> to the <code>utf8</code> encoding.
</p>
<p>
The input file to the <code>conparser</code> contain POS tagged sentences. The formats for English and Chinese are different.
</p>
 <strong>English</strong>: 
<br/>
<pre><code> Ms./NNP Haag/NNP plays/VBZ Elianti/NNP</code></pre>
<strong>Chinese</strong>: 
<br/>
<pre><code> ZPar_NR 可以_MD 分析_VV 中文_NN 和_CC 英文_NN </code></pre>
<p>
For Chinese, inputs to both <code>train</code> and <code>conparser</code> must be encoded in <code>utf8</code>.
</p>
<h2><a id="How-to-train-a-model">How to train a model</a></h2>
<p>
To train a model, use
</p>
<pre><code> zpar/dist/english.conparser/train &lt;train-file&gt; &lt;model-file&gt; &lt;number of iterations&gt; </code></pre>
<p>
For example, using the <a href="con_files/train.txt">example train file</a>, you can train a  model by
</p>
<pre><code> zpar/dist/english.conparser/train train.txt model 1 </code></pre>
<p>
After training is completed, a new file <code>model</code> will be created in the current directory, which can be used to parse POS-tagged sentences. The above command performs training with one iteration (see <a href="#tuning">How to tune the performance of a system</a>) using the training file. The commands for training Chinese parsing models are the same.
</p>
<h2><a id="How-to-parse-new-texts">How to parse new texts</a></h2>
<p>
To apply an existing model to parse new texts, use
</p>
<pre><code> zpar/dist/english.conparser/conparser &lt;input-file&gt; &lt;output-file&gt;  &lt;model&gt;</code></pre>
<p>
For example, using the model we just trained, we can parse <a href="con_files/input.txt">an example input</a> by
</p>
<pre><code> zpar/dist/english.conparser/conparser input.txt output.txt model</code></pre>
<p>
The output file contains automatically parsed trees. The commands for parsing Chinese texts are the same.
</p>
<h2><a id="Outputs-and-evaluation">Outputs and evaluation</a></h2>
<p>
In order to evaluate the quality of the outputs, we can manually specify the gold parse trees of a sample, and compare the outputs with the correct sample.
</p>
<p>
Manually specified parse trees of the input file are given in <a href="con_files/reference.txt">this example reference file</a>. Refer to <a href="http://nlp.cs.nyu.edu/evalb/">evalb</a> to obtain a software that performs automatic evaluation.
</p>
<p>
Using the above <code>output.txt</code> and <code>reference.txt</code>, we can evaluate the accuracies by typing
</p>
<pre><code> ./evalb -p &lt;config.file&gt; output.txt reference.txt</code></pre>
<p>
Here <code>config.file</code> sets running parameters of the evaluation. <a href="con_files/COLLINS.prm">COLLINS.prm</a> is a widely used configuration file.
Evaluation results will be printed to the console.
</p>
<h2><a id="tuning">How to tune the performance of a system</a></h2>
<p>
The performance of the system after one training iteration may not be optimal. You can try training a model for another few iterations, after each you compare the performance. You can choose the model that gives the highest f-score on your test data. We conventionally call this test file the development test data, because you develop a parsing model using this. Here is a <a href="con_files/test.sh">a shell script</a> that automatically trains the parser for 30 iterations, and after the ith iteration, stores the model file to model.i. You can compare the f-score of all 30 iterations and choose model.k, which gives the best f-score, as the final model. In this file, this is a variable called <code>parser</code>. You need to set this variable to the relative directory of <code>zpar/dist/english.conparsr</code> or <code>zpar/dist/chinese.conparser</code>.
</p>
<h2><a id="Source-code">Source code</a></h2>
<p>
The source code for the English phrase-structure parser can be found at
</p>
<pre><code> zpar/src/common/conparser/implementation/ENGLISH_CONPARSER_IMPL</code></pre>
<p>
where <code>ENGLISH_CONPARSER_IMPL</code> is a macro defined in <code>Makefile</code>, and specifies a specific implementation for the English phrase-structure parser.
</p>
<p>
The source code for the Chinese phrase-structure parser can be found at
</p>
<pre><code> zpar/src/common/conparser/implementation/CHINESE_CONPARSER_IMPL</code></pre>

<p>
where <code>CHINESE_CONPARSER_IMPL</code> is a macro defined in <code>Makefile</code>, and specifies a specific implementation for the Chinese phrase-structure parser.
</p>
<h2><a id="reference">Reference</a></h2>
<ul>
<li>
Yue Zhang and Stephen Clark. 2009. Transition-Based Parsing of the Chinese Treebank using a Global Discriminative Model. In <em>Proc. of IWPT</em>, pages 162-171.
</li>
<li>
Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang and Jingbo Zhu. 2013. Fast and Accurate Shift-Reduce Constituent Parsing. In <em>Proc. of ACL</em> pages 434-443.
</li>
</ul>

</div>
</div>
<footer class="text-center">
<p>
ZPar Release 0.7
</p>
</footer>
</body>
</html>
